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A UNIVERSAL PROCEDURE FOR AGGREGATING ESTIMATORS 

By Alexander Goldenshluger 1 

University of Haifa 

In this paper we study the aggregation problem that can be for- 
mulated as follows. Assume that we have a family of estimators T 
built on the basis of available observations. The goal is to construct a 
new estimator whose risk is as close as possible to that of the best es- 
timator in the family. We propose a general aggregation scheme that 
is universal in the following sense: it applies for families of arbitrary 
estimators and a wide variety of models and global risk measures. 
The procedure is based on comparison of empirical estimates of cer- 
tain linear functionals with estimates induced by the family T . We 
derive oracle inequalities and show that they are unimprovable in 
some sense. Numerical results demonstrate good practical behavior 
of the procedure. 

1. Introduction. The subject of this paper is the problem of aggregating 
estimators from a given collection. 

Consider the Gaussian white noise model 

(1) Y £ (dt) = f(t)dt + eW(dt), t=(t 1 ,...,t d )eV = [0,l] d , 

where / : M. d — > IR is an unknown function, e G (0, 1) and W is the standard 
Wiener process in M. d . Let C be a compact set, and assume that we 
are given a parameterized family of estimators J-q = {fe,9 € 0} of /. The 
objective is, using the observation y e = {Y E (t),t S Vq}, to select a single 
estimator from with the risk that is as close as possible to the risk 
of the best estimator in the family Tq. We refer to the outlined setup as 
the aggregation problem. Aggregation is a common approach to construction 
of nonparametric adaptive estimators; this fact motivates consideration of 
aggregation problems. 
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Typically aggregation procedures involve splitting the sample into two 
sub-samples: the candidate estimators are constructed on the basis of the 
first sub-sample, while the second subsample is used for the aggregation 
purposes. In this work we focus on the aggregation step only, and following 
Juditsky and Nemirovski (2000), Nemirovski (2000) and Tsybakov (2003) 
we regard the estimators fe, 9 GO, as known fixed functions on Dq. 

The following two types of aggregation are frequently discussed in the 
literature: 

(i) Model selection (MS) aggregation. Here = In := (1, . . . , N), and 
the corresponding set of estimators is F@ = Fi N := {fa,i £ In}, where fa are 
distinct fixed functions. 

(ii) Convex aggregation. Here 



e = A:=|AGE 7V |A i >0,^A i < l|, 



(2) 

and for fixed estimators fa, i £ In, 

F e = F A := ^F x \F x (t) :=J2\ifa(t),\e a|. 

Let / be an estimator of / based on the observation y e . We measure 
accuracy of / by its L p -risk 

lZ p [f;f]:=E f \\f-f\\ p , l<p<oo, 

where Ej is the expectation with respect to the probability measure Ff of 
observation y e under model (1), and || • || p is the standard L p -norm on T>q. 
We want to propose a measurable choice, say f = fs, from collection F® 
such that the following L p -m& oracle inequality holds: 

(3) n p [f;f}<CmfTZ p [f e ;f}+r £ 

for all / from a "large" functional class. Here C is a constant independent 
of / and e, and r £ is a remainder term that does not depend on /. 

The outlined aggregation problem has attracted much attention in the 
literature for the regression and Gaussian white noise models. Remarkable 
progress has been achieved in the framework of L2-theory where exact oracle 
inequalities [with C = 1 or C = 1 + o(l), e — > 0] were derived for collections 
of arbitrary estimators; see Juditsky and Nemirovski (2000), Nemirovski 
(2000), Tsybakov (2003). Tsybakov (2003) introduced the notion of optimal 
rates of aggregation and derived aggregation procedures possessing (3) with 
smallest possible, in a minimax sense, remainder term r e . L2-risk oracle in- 
equalities with C > 1 for arbitrary estimators were obtained, for example, by 
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Yang (2001, 2004), Wegkamp (2003) and Bunea, Tsybakov and Wegkamp 
(2007). 

Aggregation of arbitrary nonparametric estimators with respect to other 
loss functions is much less studied. Catoni (2004) and Yang (2000) con- 
sidered the problem of aggregating density estimators with the Kullback- 
Leibler divergence as a loss function. Devroye and Lugosi (1996, 1997, 2001) 
developed Li-risk oracle inequalities in the context of density estimation; see 
also Hengartner and Wegkamp (2001) who apply the approach of Devroye 
and Lugosi for the regression setup. Our results are closely related to those 
by Devroye and Lugosi, and we discuss this connection in detail in Section 3. 

For a detailed account of the literature on aggregation of estimators see 
the recent papers Audibert (2004), Birge (2006), Bunea, Tsybakov and Wegkamp 
(2007), Juditsky, Rigollet and Tsybakov (2008) and references therein. It is 
also worth noting that there is vast literature on aggregation of estimators 
from restricted families (such as orthogonal series estimators, kernel esti- 
mators, etc.), and aggregation of classifiers in classification problems. A list 
of representative publications from this literature includes Kneip (1994), 
Lepski and Spokoiny (1997), Cavalier et al. (2002), Koltchinskii (2006) and 
Lecue (2007), where further references can be found. 

In this paper we propose a general aggregation scheme that is universal in 
the following sense: (i) it applies to families of arbitrary estimators; (ii) it can 
be easily extended to different models; (iii) it can be used for a wide variety 
of global risk measures. Although the main results of this paper pertain to 
the MS aggregation setup, Gaussian white noise model and L p -risks, similar 
results can be easily established for other models and global risk measures. 
In Section 4 we illustrate universality of the suggested procedure by applying 
it to convex aggregation and to the problem of estimating a normal mean 
vector. 

Our aggregation method is based on comparison of empirical estimates 
of certain regular linear functionals with estimates induced by the fam- 
ily J-q. A closely related idea that a nonparametric function estimator is 
"good" if its integrals over cubes "agree" with the corresponding empirical 
means, belongs to Nemirovski (1985). We establish general oracle inequali- 
ties and specialize them for different sets of linear functionals. It turns out 
that universal inequalities of Devroye and Lugosi (1996, 1997, 2001) and 
Hengartner and Wegkamp (2001) can be derived from our general oracle in- 
equalities using a specific choice of the set of linear functionals. The results 
indicate that in the Gaussian white noise model (1) the problem of aggre- 
gation of arbitrary estimators in L p , p£ (2, oo], can be rather difficult. In 
this case remainder terms in the oracle inequalities depend on the family 
J-q and, in general, can be rather large. We prove a lower bound and show 
that dependence of the remainder terms on is, in a sense, unavoidable. 
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Thus "efficient" aggregation of arbitrary estimators in L p , pG (2, oo], is im- 
possible. We also show that in the L2-framework a slight modification of the 
proposed aggregation procedure satisfies the exact oracle inequality (3) with 
C = 1 and the remainder r £ that cannot be improved in the minimax sense. 

The rest of the paper is organized as follows. In Section 2 we introduce 
our aggregation scheme. Section 3 contains the main results of the paper. 
In Section 4 we apply the procedure to convex aggregation and estimation 
of a normal mean vector. In a simulation experiment of Section 4 we study 
performance of our procedure for estimating a normal mean vector. Proofs 
are given in Section 5. 

2. Aggregation scheme. We begin with construction of the aggregation 
scheme for the Gaussian white noise model (1). 

2.1. Construction. Let $ be a set of probe functions i/j:T>q — ► ]R. Con- 
sider a linear functional 



i f m=jmnt)dt, ve*. 

For given ip £ *S>, a natural estimator of £f(ip) based on observation y e is 

W)= / my £ {dt). 



On the other hand, £f(ip) can be estimated using estimates fg G J-q: 

lfM)= fmfe(t)dt, flee. 



Define 

AeW := l f W)-lfM 



mif(t)-fe(t)]dt + e J mw(dt) 
=: lmif(t)-fe(t)]dt + eZ(ij), 9GO. 



For any fixed 9 £ G, Ag(^) is a random variable that measures discrepancy 
between empirical estimate £f(ip) of the linear functional if(tp) and the 
estimate £f g (ip) induced by Jq G Fq. The idea underlying construction of 
our aggregation rule is that, for a "good" estimator fg, the absolute value of 
Ae(V') "corrected" for a random error Z(ip) should be uniformly small for 
all ip G ^. 

Let 5£ (0,1), and 

(5) x = x(<5, := mini k > 0|P [sup > xl < s\ . 

I Ue* Wh J J 
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Define 



(6) 



Mq := sup 



Ve*l HV'llg 




where p + Q =1, and let 9 := arginf# g M#; then our estimator is given 



Recently a procedure based on different ideas but close in spirit to (6)— (7) 
was used in Goldenshluger and Lepski (2007) for selection of kernel estima- 
tors from large parameterized collections. 

In order to ensure that the estimator / is well defined, certain conditions 
on the set of probe functions and on the family of estimators J-®, have to 
be imposed. First, to guarantee that x is well defined in (5), we need appro- 
priate assumptions on the intrinsic semimetric of the zero-mean Gaussian 
process {Z(ip),ip G ^}. Second, 6 should be measurable; this requirement 
calls for conditions on the sample paths of the random process {Mq,6 G 0}. 
Although general conditions that guarantee fulfillment of the above proper- 
ties can be explicitly stated, for the present we will take them for granted. 
In the aggregation setups of Sections 3 and 4 these conditions are trivially 



Note that the aggregation procedure requires specification of the param- 
eter 5 and the set of probe functions \y. The choice of ^ is a crucial step in 
construction. We discuss this issue below. 

2.2. The set of probe functions. The following norm approximation prop- 
erty of the set of probe functions plays an important role in our construc- 
tion. 

Definition 1. Given the collection of estimators F® = {fg,9 G 0} with 
index set 8, let 



(8) G e := {g : V - R\g = g T>v := f T - /„, f T , f„ G F e J T + /„}. 



Let f be a set of functions on T>q, 7 > and p G [l,oo]. We say that ^ is a 
( / y,p)-good set with respect to Q® if for any g G Q® there exists ip g G \& such 
that 



by 



(7) 



/ = h- 



fulfilled. 



(9) 




Several remarks on the above definition are in order. The set Q® contains 
pairwise differences of estimators from J 7 ®. The set of probe functions ^ is 
(7,p)-good with respect to Q® if the L p -norm of any function from Q® can 
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be approximated by a linear functional from \& with prescribed guaranteed 
accuracy 7. Since Ge is indexed by (r, v) E x 0, the corresponding (7,p)- 
good set of probe functions can be always chosen indexed by (r, !/)e6x6, 
too. Specifically, the (7,p)-good set with respect to Ge can be chosen as 
follows: 

(10) ^ = ^ e :={V>:P ^M|^ = VW,T,i;ee, r + v}, 

where ip gr v is the representer corresponding to g TjU := f T — f u such that (9) 
is fulfilled. In all that follows without further mention we always write \E , 
for a set of probe functions that is associated with (and Go) via (10). 

The (7,p)-good sets of probe functions are easily constructed. In the se- 
quel the following examples of the (7,p)-good sets will be particularly im- 
portant. 

Example 1. Let p e [1, 00) and define 

(11) *e := Um-)=M-) ■= y'J-i sign{g(-)},fl e Ge\- 
Clearly, 4>q is (0,p)-good with respect to Ge- Note also that ^0 C {-ip : 

Ml ? = i}- 

Example 2. The set 

(12) $0 := fv#(0 = M') ■= rl^O.ffe Ge 

Is \\9\\2 

is (0,p)-good with respect to Ge for any p £ [1, 00]. 
Example 3. For 7 > define 

mr\ J/i/m / 1 \ [|g(OI ~ llg||oo + 7]+sign{g(-)} nt . r \ 
I J[\9{t)\-\\g\\oo+j\+dt J 

where [•]+ =max{-,0}. It is easily verified that ^©(7) is (7, oo)-good with 
respect to Ge; moreover, \&e(7) C {V> ■ \\ip\\i = 1}- 

3. Main results. In this section we present the main results of this pa- 
per. We focus on the model selection aggregation setup where = In = 
(1, . . . ,N), !Fe = Fi N = {fi,i £ In}- Let Gi N and be defined accord- 
ingly via (8) and (10). Note that Gi N and are finite sets of functions of 
cardinality N(N — 1). Following (4), for tp G tyj N we write 

(13) = fmif(t)- fi(t)]dt + eZ(i>), ieI N . 



AGGREGATION OF ESTIMATORS 



7 



For a fixed 5 £ (0,1), x = h(6,^i n ) is given by (5); note that x is well 
defined because is a finite set. We write also 

(14) M % := max { J-[|A^)| -ex|M| 2 ]} 
and 

(15) i : = arg min M u f = f~,. 

3.1. Oracle inequalities. The next theorem establishes the basic oracle 
inequality on the L p -risk of the estimator /. 

Theorem 1. Let p£ [1, oo], and assume that ^/j N is ( , y,p)-good with 
respect to Gi N - Define : = argminjg/^ ||/ — fi\\ p and 

(16) V* lN :={tl>eVi lf \i/> = i/>f i .-f i =il>i.i,ieI N ,i^u}. 

Let 5 € (0,1) be fixed, and let x = >c(5,^j N ) be defined in (5); then for f 
given in (14)~(15) one has 



(17) 



K p [f;f}< 2 max U\\ q + 1 min)!/-/^ 



+ 2xe max ||^|| 2 + 7 + 



N 



p + maxll/illp 



s. 



Remark 1 . The proof of Theorem 1 illuminates the role played by the 
assumption that is (7,p)-good. The key is the bound on the distance 
between selected and oracle estimators, \\fi t — fi\\ p - The fact that is 
(7,p)-good allows to control this distance on an event of large probability in 
terms of the distance between corresponding linear functionals. The latter, 
in turn, is controlled by definition of the aggregation procedure. 

We now apply the oracle inequality of Theorem 1 for the sets of probe 
functions discussed in Examples 1-3 of Section 2. Assume that 

(18) max{||/|| p , H/illp, . . . , \\f N \\ P } ■= L < 00. 

Corollary 1. Let = ^i N where $e is defined in (11). Suppose 
that (18) holds; then for f given in (14)-(15) and associated with ^i N and 
5 = e one has 



N 2 

(19) Tl p /; /] < 3 min ||/ - /<| p + 2Q 1 (p)e\ 2ln— + 2Le, 
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where Qi(p) = 1 for 1 < p < 2, and 



(20) Qi(p) = Qi(Ti N ,p) := max 

i^i* 



life -All, J ' 2<P<0 °- 



Remark 2. Our selection rule with = and p = 1 reduces to 
the aggregation method by Devroye and Lugosi (1996, 1997, 2001). Indeed, 
when p = 1, the probe functions from the set *$>i N are given by ipij = sign(/j — 
fj). In the density estimation context this corresponds to the Yatracos 
classes considered by Devroye and Lugosi. Note also that when p G [1,2] 
and ^fi N = ^i N , the selection rule (14)-(15) could be modified as follows: 

z = argmin max |Aj(^)|. 

In this form our selection rule can be viewed as an implementation of 
the method by Devroye and Lugosi for the white noise model [see also 
Hengartner and Wegkamp (2001)]. For further discussion see Section 3.3. 

Corollary 2. Let p£ [l,oo], and ^ = where is defined in 
(12). Suppose that (18) holds; then for the estimate f given in (H)-(15) 
and associated with and 5 = e one has 



N 2 

(21) Tl p [f- f] < (2Q 2 (p) + 1) min /< - f\\ p + 2Q 3 (p)e\2ln — + 2Le, 



where 



f \ r\ IT- \ Wfi* fiWpWfi* fiWq 

Q2 (p)=Q2 {Fl N ,P)-= max - 



|| fu - I 

(22) 



Qzip) = QsiFiNiP) ■= max 



|| fi t fiWp 



e'iv \\fi, - fi\\ 2 ' 

In contrast to &i N , the rule associated with allows to treat the case 
p = 00. Note, however, that it leads to the elevated factor preceding the best 
possible risk as compared to the selection rule that uses . 

Corollary 3. Let (18) hold with p = oo, and ^i N = ^i n (j ) with 70 = 
ey/lnN < L; then 



N 2 

(23) Hoc [f; /] < 3 min Wfi - /||oo + 3Q 4 (7o W21n + 2Le, 
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where 



(24) 



( \ it \ 5i»i(-?7) 2 

Qiin) =Q4{Fl N ,V : = max —— rrp, 

^(•,7) := " fi(')\ ~ \\fu ~ Alloc + 7]+- 



The above results show that when p£ [1,2] arbitrary estimators satisfying 
(18) can be efficiently aggregated in the following sense. Corollary 1 demon- 
strates that if = , then the resulting risk of the selected estimator is 
within factor 3 of the best possible risk whereas the remainder term is of the 
order e^ln(N 2 /e). Thus one can aggregate polynomial in e _1 number N of 
estimators with remainder term of the order e-\/ln(l/e). Such a bound allows 
to derive minimax and adaptive results in many nonparametric estimation 
setups. 

The situation is completely different for p G (2, oo] . Here remainder terms 
in the oracle inequalities depend on the family of aggregated estimates 
through the values of Qi(p), Qs{p) and Qa{i) that can be large for par- 
ticular families Ti N . 

3.2. Lower bound. The important question is whether the remainder 
terms in (19), (21) and (23) can be improved for families of arbitrary estima- 
tors Ti N whenever p > 2. The next result shows that, in a sense, dependence 
of the remainder terms on the family Ti N is unimprovable in the MS aggre- 
gation setup. 

Theorem 2. Assume that N > 3 and p G (2, oo]; then there exists a 
family fi N = {fi,i G In} of functions on T> , satisfying maxj g/jv ||/j|| p < L 
such that for any selection rule f :y e — > T\ N and any e < L(N In iV) -1 / 2 one 
has 



(25) max 

fer lN 



^p[/;/]- m p||/-/i||p 



>cK p £Jln(N -I), 



where K p = Qi{Fi N ,p) = Qz{Fi N ,p), Vp G [2,oo), = Q 3 (F In ,oo) = 
Qi(J r i N ,^f), V7 > 0, and c is an absolute constant. The quantities Q\, Q3 
and Qi are defined in (20), (22) and (24), respectively. 



Remark 3. Because min^g/^ ||/ — fi\\ p = for / G Fi N , (25) provides 
a lower bound on the remainder term in the L p -risk oracle inequality. The 
worst-case family Ti N in the proof of Theorem 2 is such that the IL,2-norm 
of pairwise differences of its members is small in comparison with their L p - 
norm. We note also that the worst-case family Ti N does not depend on p. 
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Theorem 2 shows that the problem of aggregation of arbitrary estimators 
in Lp, p E (2,oo] may be rather difficult. In particular, the proof of the 
theorem suggests that the L p -risk of any aggregation procedure can be as 
large as e 2 / p (lniV) 1/p , p G (2, 00]. 

The meaning of the lower bound of Theorem 2 is that there is a family of 
estimators that cannot be aggregated with accuracy better than that in (25). 
This, however, does not imply that the same lower bound holds for a concrete 
family of reasonable estimators. It is known, for example, that kernel esti- 
mators can be efficiently aggregated in L p , p > 2 [Goldenshluger and Lepski 
(2007)]. 



3.3. Modified aggregation procedure. In the definition of the aggregation 
procedure [see (14)], the "typical" value of the stochastic error, ex\\ip\\2, is 
subtracted from |Aj(^)|. Thus, this construction requires prior knowledge 
of the noise level e. We note, however, that the original procedure can be 
modified in such a way that e need not be known. 

Specifically, consider the following procedure: with Aj(^) given in (13) 
define 

(26) Mi := max j-L|A^)|} 

I \m\q > 

and let 

(27) i :=arg min Mi, / = /•. 

i&i N 

This construction does not require prior knowledge of the noise level e. The 
next theorem establishes an oracle inequality for the estimator /. 



Theorem 3. Let conditions of Theorem 1 hold; then for the estimator 
f defined in (26)-(27) one has 



Hp[f;f] < 2 max \\^\\ q + 1 min ||/ - fi\\ p 



(28) 



+ 2xe max 



max 



^*I N ^*lN 



+ 7+ ll/llp + m ax||/i||p 



8. 



Remark 4. The second term on the right-hand side of (28) is greater 
than or equal to that on the right-hand side of (17). However, in special 
cases oracle inequality (28) is precise enough. For instance, if p = 2, then the 
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remainder terms in (28) and (17) coincide. Note also that in the setup of 
Devroye and Lugosi (2001) (p = 1 and ^fj N = *&i N ; see Remark 2) we obtain 

2xe max ||^||oo max j^j^ 2 < 2xe 

because \\1pW2 < Halloo = 1 for every ip G ^i N whenever p = 1. In these cases 
the use of the modified selection rule is advantageous as it does not require 
knowledge of the noise level e. 



3.4. 1^2-risk oracle inequality. If p = 2, then the general oracle inequality 
of Theorem 1 can be improved. In particular, we demonstrate that in this 
specific mild modification of the original aggregation procedure leads 

to the exact oracle inequality with the leading constant equal to 1. 

First we note that the sets of probe functions fyj N and *$>i N coincide when 
p = 2: 

(29) M-) = f tu~ f A'\ i,j€l N ,i^j. 

\\ji - Jj\\2 

Let Uij = \{fi + fj), and for all i G In define 
Mi := max{4 (tpij) - lf(Aj)} 

(30) 

= max jy ipij(t)uij(t)dt - J ipij(t)Y E (dt)\. 

The selection rule is defined by 

(31) i = arg min M h f = ft. 

We remark that 1 1 "0* j [ 1 2 = 1> Vi, j G In, i 7^ j- A distinctive feature of the 
selection rule (29)-(31) is that for each pair i,j G In the empirical estimate 
of the linear functional ^/(V'ij) is compared with (-ui-i^ij) an d not with 
£fi(tpij) as in (13). 

Theorem 4. Let j = ft be the estimator defined by (29)-(31); then 
K 2 [f; f] < min Wft - f\\ 2 + 8eV21niV. 



Thus the selection rule (29)-(31) achieves the optimal rates of the MS 
aggregation when the L2-risk is considered [cf. Tsybakov (2003)]. 
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4. Miscellaneous extensions and numerical results. The objective of this 
section is to demonstrate that the proposed procedure can be applied for 
different models and global risk measures. First we discuss the problem of 
convex aggregation, and then we show how the aggregation scheme can be 
applied for estimation in the normal means model. We also provide some 
numerical results for the problem of estimating a normal mean vector. 

4.1. Convex aggregation. The problem of convex aggregation is formu- 
lated as follows: given a set of estimators /j, i £ In, the objective is to select 
an estimator, say F = F?, from the collection 



?k = ^F x \F x (t) = A e A j, 



such that F^ is nearly as good as the best estimator from T\. Here A is the 
A^-dimensional simplex; see (2). 

For r\ > let A^ = (\( k \k = 1, . . . , n„) denote the minimal ry-net of A in 
/i-norm; that is, for any A G A there exists A^ G A,, such that 



v 



|A-A(*)| 1 = X>-Af|<r, 



i=l 



Let Q\ = {g\g = F\ — F v , A, v G A, v ^ A}, and let Q\ be defined similarly 
with A replaced by A^ [cf. (8)]. Note that is a finite set with card(^A^) = 
n v (n v - 1). 

We begin with a lemma showing that if (18) holds, then any (0,p)-good 
set with respect to Ga v is also (7,p)-good with respect to Qa with some 
7 = 7(7?) >0. 

Lemma 1. Assume that (18) holds, and let be the (0,p)-good set with 
respect to Q\ v . Then ^ is ( / y,p)-good with respect to with 

(32) 7 = 2Lr/l + max|H|A 

\ few / 

Lemma 1 allows to reduce the problem of convex aggregation to the MS 
aggregation over a finite family of estimators. The idea is to apply the se- 
lection procedure of Section 2 to the finite set of estimators induced by the 
minimal ry-net A^ in A. 

Similarly to (13), for ip G Vl/ we write 

A A (V>)=W)-^\(V0 

m\f{t) - F x (t)] dt + e [ mw(dt), A G A. 
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Let rj = e, and A e = {A^ fc \ k = 1, . .. ,n e } be a minimal e-net in Zi-norm for 
A. Let *a £ be a (0,p)-good set w.r.t. G\ e . For 6 £ (0,1) let k = x(5,^\ e ) 
be given by (5). Define 

(33) A:=argmmmax{^[|A AW |- e x M2 ]}, F := F % . 



Theorem 5. Assume that is (0,7)-good with respect to Qjs. e . Then 
for x = x(6, ^a e ) defined in (5) and F given by (33) one has 

n p [F;f}<(2 max ||^||, + l) mm \\f - F x \\ p 

+ 2xe max ||^|| 2 + 2Ls( 1 + max \\tP\L) + 2L5. 

The oracle inequality of Theorem 5 can be straightforwardly specialized 
for specific sets of probe functions. We provide here only one result corre- 
sponding to Example 1 in Section 2. 

Corollary 4. Let ^a £ = *A e where is defined in (11). Then for 
the estimator F associated with 5 = e one has 



K P [F; f] < 3 mm ||/ - F x \\ p + cQ 1 {p)e^N In i + QLe, 



where c is an absolute constant, and 

1, 1<P<2, 

2 < p < oo. 



Qi(p) ■-- 



max 

A,;/eA £ 



The proof is identical to that of Corollary 1; it suffices to note only that 
n £ = card(A e ) = (c'e -1 )^, where c' is an absolute constant. 

It is well known [Tsybakov (2003)] that in the problem of convex aggrega- 
tion with p = 2 and N < e _1 the optimal (in a minimax sense) order of the 
remainder term is eV~N- In this particular case, our aggregation procedure 
achieves the indicated bound within a logarithmic in e _1 factor. 

4.2. Normal means model. Consider the normal means model 

(34) Y = n + ew, /i£R n , w ~ W„(0, £), 

where ^, is an unknown vector and £ is the noise correlation matrix. We 
want to estimate // using the observation Y. The model (34) is a prototype 
of many different nonparametric models [see, e.g., Johnstone (1998)]. 
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Suppose that we are given a family := {fj,i,i £ In = (1, . . . , N)} of candi- 
date estimators of fi. As before, we regard the estimators fii, i G In as fixed 
deterministic vectors. The risk of an estimator /} is given by E^|/z — fi\ p , 
where | • | p , p G [l,oo], stands for the standard p-norm in W 1 . The objective 
is to select a single estimator from whose risk is as close as possible to 
that of the best estimator in 0. 

The general aggregation scheme of Section 2 can be easily adapted for 
the outlined setup. Let ^ denote a set of probe vectors from M. n . For tp G ^ 
define the linear functional £^(ip) = ip T fJ- and for every ip G ^ consider the 
following estimators of 

4(V0 = ?p T Y, = ^ T W, i £ In- 

Define A^(^) = 4 WO - Zi{4>) and note that Aj(^) =ip T (fi- m) + eZ(^) 
where Z(vp) = ip w is a zero-mean normal random variable with variance 
|V4:=V T &/>. 

The aggregation procedure is defined as follows. Let 6 G (0, 1), and let 

(35) x=x(5, := minix > 0|pf max > <<fl. 

I We* Ms / J 

Let, as before, g and p be the conjugate exponents, and define 

(36) M i: =max{^(|A 4 (VO|-xeH E )), 

V>e* I \ip\ q J 

(37) i:=argminMj, p,:=fi^. 

i£l N 

According to Section 2, the set of probe vectors should have some 
"good" norm approximation properties. In the context of the normal means 
model this requirement is formulated as follows. 

Definition 2. Let 

Q := {g G R n : g = m - nj, i / j, i,j G In}, 

and let 7 > 0. We say that the set of vectors ^ from M. n is (7,p)-good if for 
every vector g GQ there is a vector ip g G ^ such that 

\tpg9- \g\ P \ <7- 

As before we will use (7,p)-good sets in the form 

* = {V# = i>ij ■=i> IH - H ,i^3ii,3 £In}, 
where ipij is a vector such that 

l^ijifM ~ fij) ~ lA*i - MjIpI < 7- 

Now we are in a position to establish an oracle inequality for the aggre- 
gation rule (36)-(37). 
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Theorem 6. Letp£ [l,oo], VP be a (7, p) -flood set, 5e (0,1), and let xr 
be defined in (35). Assume that 

max{|p| p , \fn\p, |/ijv|p} = : £ < °°. 

Define i* = argmirij — and 

** := G *|^> = V'uj = 7^i*,3 G -Tat}- 

Then for ft given by (36)-(37) one has 

EJ/i — < (2 max + 1 ] min - /u| p 

V 'AG** / l 

(38) 

+ 2xe max l^ls + 7 + 2L<5. 

</>G*» 

The proof of Theorem 6 is identical to that of Theorem 1, and it is omitted. 

The oracle inequality of Theorem 6 is easily specialized for specific sets 
of (7,p)-good probe vectors. For example, let p G [l,oo) and define ipij G W 1 
by 



p-i 



\fJ>i [ij\p 

where a(k), k = l,...,ra, denotes the kth component of a generic vector 
a G W 1 . Then the set of probe vectors V? := {ipij,i / j,i,j G In} is (0,p)- 
good. Note also that * C {ip : \ip\ q = 1}. 

The next result is an immediate consequence of Theorem 6. 

Corollary 5. Let p G [1, oo), VP = VP, and assume that £ is the identity 
matrix. Let 5 = e; then 



N 2 

EJ/t — /xL < 3 min — /XjL + 2Q(p)e\ 2 In h 2Le, 

iG/jv V e 

w/iere 

2 <p< oo, 



p-i 

Kp<2, 



max 

max[card{£; : //«(£;) / (A;)}] 1 / 2 , p = 1 



Corollary 5 shows that if p G [2, oo), then the risk of the selected estimator 
is within factor 3 of the best possible risk whereas the remainder term is of 
the order E\J\i\{N 2 /e). If p G [1,2), then the remainder terms in the oracle 
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inequalities depend on the family of aggregated estimators. The situation 
here is opposite to that discussed in Section 3 because of reciprocal behavior 
(with respect to p) of L p -norms on [0, l] d and p-norms in R n . 

The aggregation procedure (36)-(37) requires prior knowledge of the noise 
level e and the noise covariance matrix E. However, (36)-(37) can be mod- 
ified in the spirit of Section 3.3. Specifically, let 

(39) Ai,:=max{^|A, W |}, 

(40) z:=argminMj, fx:=ix~-. 

The next result establishes an upper bound on the accuracy of fx. 

Theorem 7. Let conditions of Theorem 6 hold. Then for the estimator 
fx one has 

EJfi — n\ p < I 2 max \ip\ q + 1 ) min \fXi - fx\ p 
V V>e*, / i 

41 

+ 2xe max \ib\ a max ■—. h 7 + 2L5. 

The proof is identical to that of Theorem 3 and it is omitted. 

Even though the right-hand side of (41) is greater than or equal to the 
right-hand side of (38), fx can be advantageous in comparison with fx. For 
instance, if p = 2, and if the ratio of the norms | • |s and | • I2 does not depend 
on N, then the second terms on the right-hand sides of (41) and (38) are 
of the same order. In this case it is advantageous to use the estimator fx 
because it does not require knowledge of e and S. 

4.3. Some numerical results. A small simulation study was carried out in 
order to illustrate usefulness and practical potential of the proposed scheme. 
We investigate performance of our procedure for estimating a normal mean 
vector under the following two different scenarios: 

(i) the vector has K randomly located nonzero coefficients; 

(ii) the vector has K first nonzero components. 

Under the first scenario thresholding estimators with properly chosen thresh- 
old will presumably perform well. In this context our selection rule pro- 
vides an estimator that adapts to unknown sparsity. Recently the topic 
of adaptive estimation of sparse vectors has attracted much attention in 
the literature; we refer to Abramovich et al. (2006), Golubev (2002) and 
Johnstone and Silverman (2004) where further references can be found. In 
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Table 1 

The lu2-risk averaged over 100 replications in estimating (i) a normal mean vector with 
K randomly located nonzero coefficients; (ii) a normal mean vector with K first nonzero 

coefficients 





K 


Oracle 


Aggregation 


Best projection 
estimator 


Best thresholding 
estimator 


K 


(i) 


5 


2.498 


2.726 


4.593 


2.499 


5.26 




50 


6.446 


6.557 


13.994 


6.446 


50.08 




250 


11.388 


11.559 


19.949 


11.388 


292 




500 


13.649 


14.378 


24.471 


13.649 


613.03 


(ii) 


10 


1.551 


2.340 


1.556 


2.582 


11.29 




50 


3.546 


3.916 


3.546 


5.589 


44.89 




250 


8.608 


8.955 


8.608 


11.337 


283.69 




500 


11.200 


11.230 


11.200 


14.566 


497.33 



the second scenario projection estimators are appropriate. As we will see 
below, our estimator mimics the best estimator closely in both cases. 

Conditions of our numerical experiments are as follows. We consider the 
normal means model (34) with n = 1000 and S being the identity matrix. 
In the first scenario components of the unknown vector ^ are assumed to be 
zero except K = 5, 50, 250, 500 randomly chosen locations where they take a 
specified value m = 2. In the second scenario the unknown vector \x has first 
K = 10, 50, 250, 500 nonzero components that are generated as independent 
standard normal random variables. In both scenarios the results are averaged 
over 100 replications for each value of K. 

In our experiments we use two samples (random vectors) Y\ and Y^'- the 
first one Y\ ~ A/iooo(a 4 ; e i-0; e i = 0-5, is used for construction of estimators, 
while the second one I2 ~ A/1000 {^1 £ 2-0j ^2 = 1, is for the aggregation pur- 
poses. The collection contains 20 estimators fa,... ,fao- 

• 10 projection estimators fa, i = 1, . . . , 10, 

fa(k)=Yx(k)l(k<oidi), k = 1,...,1000, 

with ord = (5, 10, 20, 50, 100, 200, 300, 500, 700, 800) . 

• 10 thresholding estimators fa, i = 11, . . . , 20, 

fa(k)=Y 1 (k)l{\Y 1 (k)\>eiy/2]n(n/t i - W )}, k = l,..., 1000, 
where t = (1, n l l\ n 1 ' 2 , n 3 / 4 , n 5 / 6 , n 7 / 8 , n 9 / 10 , n 15 / 16 , n 31 ^ n 63/6 4) _ 

The estimators are aggregated on the basis of the second sample Yi using 
the modified procedure (39)-(40) with p = 2. 

Table 1 reports on the average L2-risk of the proposed aggregation pro- 
cedure (Aggregation), and the average L2-risks of three oracles that know 
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the vector to be estimated and select: (a) the best estimator (Oracle) in the 
collection; (b) the best projection estimator in the collection; and (c) the 
best thresholding estimator in the collection. The last column K displays 
the average number of nonzero coefficients in the selected estimate. Part (i) 
of the table presents results for the first scenario while part (ii) corresponds 
to the second scenario. 

The results indicate that in estimating sparse vectors [part (i) of the table] 
in almost all replications thresholding estimators outperform the projection 
estimators. The situation is opposite for vectors with nonzero first coeffi- 
cients [part (ii) of the table]: here projection estimators perform better. In 
both cases our aggregation procedure follows closely the best estimator from 
the collection for all values of K. The results in the last column also suggest 
that the aggregation procedure recovers a sparsity pattern of the estimated 
vector. 

Additional insight into performance of the aggregation procedure is gained 
from Figures 1 and 2. These figures show typical behavior of the procedure 
under scenarios (i) and (ii). The rows (a)-(d) of the diagrams in Figures 1 
and 2 correspond to different values of the parameter K. In each replication 
the competing estimators fa, i = 1, ...,20, were ranked according to their 
performance measured by the L2-risk. The barplots in the left column of the 
figures display the number of replications out of 100 where the aggregation 
procedure selects the estimator with ranks 1, 2, . . . , 20. The diagrams in the 
middle column of Figures 1 and 2 show how many times the estimators 
(ii were selected. The right column displays the L2-risk of all estimators 
averaged over 100 replications. 

It is seen from the barplots in the left column of Figure 1 that in the 
cases K = 5, 50, 250 the procedure selects the best estimator in more than 
65% of replications. In particular, for K = 5 the middle panel in the row (a) 
demonstrates that most of the time the procedure selects the estimators 
(in and fii2 (the thresholding estimators with t = 1 and t = ra 1 / 4 , resp.). 
The corresponding barplot in the right column shows that the average L2- 
risks of these two estimators are significantly smaller than those of the other 
estimators. Similar patterns are observed when K equals 50 and 250 [the 
rows (b) and (c) of Figure 1]. On the other hand, in the case K = 500 inferior 
estimators are chosen more frequently. Here the procedure selects one of the 
seven thresholding estimators with t > n 3 / 4 . As the right panel in the row 
(d) indicates, the average L2-risks of these estimators are quite close. This 
fact explains the shape of the barplot in the corresponding left panel. 

Similar conclusions can be drawn from the barplots of Figure 2. In the case 
K = 10, according to the middle panel in the row (a), the procedure selects 
either the projection estimators with ord = 5, 10, 20, 50, or the thresholding 
estimators with t = 1, n 1//4 . The right panel in the row (a) shows that the 



AGGREGATION OF ESTIMATORS 



19 




I 5 

5 8 " 



Rank of estimator 



10 

Estimator index 
(a) K = 5 





:00 








90 






■~ 


SO 




in 


y 


70 






C 
.9 


^3 










60 






S 


| 










Sfl 








"5 






Q 


0) 


40 








1 


30 






1 


Z 


20 










10 




1 





Rank of estimator 



Estimator index 

(b) K = 50 




100 — 

90 " 

80 - 

70 - 

60 - 

50 " 

40 - 

30 - 

20 " 
10 

o — 



Rank of estimator 



Estimator index 
(c) K = 250 




30 

S 25 



CD 20 
2 



Rank of estimator 



Estimator index 

(d) K = 500 



i 



Estimator index 




Estimator index 



Estimator index 



in 



Fig. 1. Scenario (i). Left column: the number of replications out of 
dure selects the estimator with rank 1,2,..., 20. Middle column: the 
versus the estimator index. Right column: the average h'z-risk versus 
Sparsity parameter K: (a) K = 5; (b) K = 50; (c) K = 250; (d) K = 
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Fig. 2. Scenario (ii). Le/i column: the number of replications out of 100 where the pro- 
cedure selects the estimator with rank 1,2,..., 20. Middle column: the number of selections 
versus the estimator index. Right column: the average h2-risk versus the estimator index. 
The parameter K: (a) K = 5; (b) K = 50; (c) K = 250; (d) K = 500. 
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average risks of these estimators are quite close. On the other hand, when 
K = 500 [the row (d) of Figure 2], the projection estimator of the order 
ord = 500 is selected in all replications, and its average risk is significantly 
smaller than the risks of all other estimators. 

Summing up, the shapes of the diagrams in Figures 1 and 2 and our 
numerical experience suggest that performance of the procedure is essen- 
tially determined by the risks of the estimators to be aggregated and by 
the noise level £2 at the aggregation stage. The procedure succeeds to de- 
tect the best estimator in a majority of replications when its performance 
is "significantly" better than the performance of the other estimators in the 
collection. Significance here is relative to the noise level £2 at the aggrega- 
tion stage. On the other hand, if there is a large number of good estimators 
that perform almost equally well, the procedure makes more errors in the 
estimator selection. However, this does not lead to a significant increase in 
the risk. Our numerical experience shows also that behavior of the proposed 
aggregation procedure is quite reasonable for the Li-losses as well. 



5. Proofs. 



5.1. Proofs of Theorem 1 and Corollary 1. 



Proof of Theorem I. (1) We begin with the following simple obser- 
vation. Let 

(42) A„:=L: max If^i < x}, 

I 4>^>i N \m\2 J 

where x = x{5, ^i N ) is defined in (5). It follows from (13) and definition of 
A x that for any tp € ^fj N and i & In 



(43) |A^)|1(^)< 
Therefore 



mif(t)-fi(t)}dt 



MA{A K )= max _L[|Ai(V0| - £*|M| 2 ]1(A X ) 

< 11/ -Ally Vie/jv- 

(2) Write 

11/ " /Hp = 11/ " f\\ P l(A x ) + 11/ - f\\ p l(A c x ). 
By definition F(A K ) >l — 8. Let = argminjg/^ ||/ — fi\\ p and /* = /$„ ; then 
(45) 11/ - f\\ p l(A x ) < II/, - f\\ p l(A x ) + \\f u - f t \\ p l(A x ). 
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Our current goal is to bound the second term on the right-hand side of (45). 
First we note that 



(46) 



mifj(t) - fi(t)]dt Vz,j e i N ,ip e * lN . 



By the premise of the theorem *$>j N is (7,p)-good w.r.t. Gi N ', hence there 
exists a probe function, say, tp^ := ipf u ^f« £ ^fi N such that 



(47) 



ll/i.-/jllp< 



+ 7- 



Therefore we have on the set A % 

\\fu-m P <\&iMud-Hi>ui)\ + i 

< - exU tJ \\ 2 ] + [|A;WVi)l " 
+ 2ex||^j|| 2 + 7 

(b) . 

< (Mi.+Mj) max ||^||, + 2ex max h/» 2+7 



(48) 



(c) „ 

< 2M it max 

-"JV 



(d) 

< 2 



max 



g + 2ex max ||^|| 2 +7 

'JV 



||/-/jj|p + 2ex max 1 1 V 7 1 1 2 H- 7, 



* A 



where (a) follows from (46) and (47), (b) is by definition of Mj and because 
ip { j e [see (16)], (c) follows from (15) and (d) is by (44). 
(3) On the set A C H we have 



\\f-f\\ p l{A%)< 



p + max||/i|| p 

i£l N 



HAD. 



Combining this inequality with (48) and (45) we complete the proof. □ 



Proof of Corollary 1. By Example 1, ^!j N is (0,p)-good so that 

7 = in (17). Moreover, \\ip\\ q = 1 for all tp £ . Since the cardinality of 
"^i N equals N(N — 1) we have 

\zm 



P< max 



^l N W\\2 



>x\<N z exp{-^/2}. 



It follows from the definition of x and the preceding inequality that N 2 e K ' 1 1 2 > 
5 so that x < v / 2\n(N 2 /5) = y/2\n.(N'*/e). 
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If pe [1,2], then 

max ||^|| 2 < max \\%/}\\ q = 1. 

On the other hand, if 2 < p < oo, then in view of (11) 

||/t. - /ilbp-" 1 



max || V||2 = max 
7 ^ ^ 



i\\p 



Combining these inequalities with the statement of Theorem 1 we come 
to (19). □ 

5.2. Proof of Theorem 2. Let Bi, i = 1, . . . , N be disjoint subsets of Vq 
such that mes(-Bj) = h, Vi, where < h < 1/iV, is a given number. Here 
mes(-) stands for the Lebesgue measure in Define fi(x) =LlB i (x), i G 
JjV) and .F/^ = G /at}. Note that maxjg/^ ||/i|| p < L for all p G (2,oo]. 
If / G F)^, then min i( z lN \\f - fi\\ p = ||/j„ - /|| p = 0. Moreover 



and 



(49) 



\\fi - fjWp = (2h) 1 /PL =: s €l N ,i? j, 



Ql (f lN ,p) = max ^ ^§g» = (2/^-i/s, 
Q 3 (Fw) = maxfc* = (2^-V2. 

jeJ JV - fi\\2 



It is immediately seen that for a chosen family of functions one has 

Q^i N ,i) = ^^ = (2hy^ v 7 >o, 

which coincides with (49) for p = oo. Denote K p := (2h) 1 ^ p ~ 1 ^ 2 , p£ (2,oo]. 
Let / : y e — > Ti N be an arbitrary selection rule. We have 

(50) sup E / ||/-/|| p >^maxpJ||/-/i|| p >f}>fmaxP i {I^4, 

where Pj = P £ probability measure of the observation y e associated with 

f = fi, and i'.y e — ► {1, . . . , iV} is the decision rule that selects function /j 
closest to / in the L p -norm. 

Let K(¥i,¥j) denote the Kullback-Leibler divergence between Pj and Pji 

1 - - hL 2 

K(Fi,Fj) = ^\\fi - fjg = — G I N , i + 3- 



24 A. GOLDENSHLUGER 

Then by the Fano inequality [see, e.g., Devroye (1987), Section 5.9] 

t ~ , , hL 2 e~ 2 + \n2 

maxPi{i + i\ > 1 — — — . 

ie/jv ln(iV- 1) 

Choosing 

£^/5, ,„ , ^ . e 2 



h = h, = j^i K -HN-l)-ln2j>-^HN-l) 

(the last inequality follows from N > 3), we obtain that maxjPjji 7^ i} > 1/6. 
Note that condition e < L(N In N)~ l l 2 implies h* < 1/N so that the sets Bi 
are indeed disjoint, as assumed. Hence (50) yields 

sup ^ f \\f-f\\ p >L(2h^/v 

This completes the proof. 

5.3. Proof of Theorem 3. The proof goes along the same lines as the 
proof of Theorem 1; below we indicate only the differences. We use the same 
notation as in the proof of Theorem 1. 

First we note that for all i £ In 

Mil(A x )= max /_L_|A i (^)|}l(A x )<||/-/ i ||p + ex max 
V'e*/^ I \\ip\\ q J i>^i N 

Because *&i N is (7,p)-good, there is a probe function, say, tp i j € ^i N such 
that 



ll/i.-/illp< 



^i(Wu(t)-f~i(t)}dt 



+ 7- 



Then, similarly to (48), we have on the set A x 

u,-m P <\^Mj)-^j)\+i 

<2M U max + 7 

<2 max ||V ) ||J||/-/ i J| p + £ x max j^)+7- 
This leads to the inequality (28). 
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5.4. Proof of Theorem 4- Throughout the proof (•, •) denotes the stan- 
dard inner product in IL^X'o)- 

We start with the following simple observation. Let Si. be the best esti- 
mator in the family Ti N , that is, = argminjg/^, — /||2- Since for any 

Wfi. " f\\l = Wfj ~ fWl + Wfi. ~ fjWl + 2{/u " fj,fj ~ f) 
and ||/i.-/|| 2 < ||/i-/|| 2 , then 

Wfi. ~ fjWi + 2</i, " fj, fj ~f) = 2(/i. - fj, + fj) ~ f) 

= 2(/ i .-/ j ,u i . i -/)<0 ViG/jv, 

or, equivalently, 

(51) max(^,j,Ui,j - /) < 0. 

J'6/jv 

We have 

II/"- /III = Wfi. - fWi + 2(/i - + /O - /) 

= ll/u-/||l + 2||/i-/ij|2(^,^-/) 

= ll/u-/lli + 2||/i-/i,|| 2 {(^,^)- [ ifo.(t)Y e (dt)\ 

(52) 1 J 
+ 2||/ I -/ i J 2 eZ(^ i J 

(b) 

< " fWl + nfl - fi. W*Mi + 2\\fi - fi, heZ(ir u J 

< Wfi. ~ fWl + 2Wfi - fi. W2M U + 2\\Si - Si. W2eZ(ir u J, 
where Z(ip) = j ip(t)W(dt), (a) is by definition of Uij and ipij, (b) is by 
definition of Mi, and (c) follows from the definition of i. 

Now we note that 

M, < maxfyi^jjUi^j — f) + £max2(^ t j) < emaxZ^j), 

where the last inequality is a consequence of (51). Therefore it follows from 
(52) and Z(ipij) = —Z(ipji), that 

Wfi - fWl < Wfi. - fWl + m - /d| 2 emax|Z(V^)l- 

Hence by the triangle inequality 

Wfi - /111 - Wfi. - /111 < 4(||/i - fh + II/,, " /|| 2 )emax |Z(^-)I 

and finally 

11/ - /|| 2 < II/*, - /lb + 4emax|Z(V^)l- 
Taking the expectation we complete the proof. 
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5.5. Proofs of Lemma 1 and Theorem 5. 

Proof of Lemma 1. Let g G Ga, that is, for some A, v G A one has # = 
J2iL\(K — v i)fi- There exist VsA, such that |A — A|i < rj and |z>— z/|i < rj. 
Define g = J2f=i(^i ~ ^i)U\ by definition, g G Ga v - Because is (0,p)-good 
with respect to Ga v , there exists tp = ip g G \£ such that 



tpg(t)g(t)dt= \\g\\ p . 
With this representer -0^ applied to g G Ga we obtain 

V^)<?(i)^ = |lsll P + / i>~ g (t)[g(t)-g{t)]dt, 



and therefore 



i>~ g {t)g{t)dt 



< wWv - IblU + 



< 115-511?+ Il^||g||5-5||p 

= ^ + U~ 9 \\ q )\\~9-9\\v 
To complete the proof it is sufficient to note that 

TV N 

g(t) - g(t) = £(A; - \i)fi(t) - - "<)/<(*); 

j=l i=l 
iV 

\\g -g\\ P < ^2l\\ - Ai| + | z>i - fi|]||/i|| P < 2L77. 

i=l 



hence 



□ 



Proof of Theorem 5. The proof goes along the same lines as the 
proof of Theorem 1; here we indicate only the main differences. First we 
note that similarly to (44) one has 

max J-[|A A (V)|- e x||^|| 2 ]l(^)<||/-F A || p VA G A, 

V'6*A e \\V\\q 

where is the event defined in (42) with max^f^ replaced by max^, 6 $ A£ . 

Define A* = argmin\ ||/ — -F\||p- The main difference with the proof of 
Theorem 1 is that now the set of probe functions *$>a £ is (7>p)-g°°d with 
respect to Ga with 7 given by (32), and the inequality (47) holds for some 
representer, say ip~ x , with v G A £ . In contrast to the proof of Theorem 1, in 
general v 7^ A*, because A* does not necessarily belong to A E . This implies 
that in the resulting oracle inequality we have maxima over tp G > and 
not over the subset of ^S>a £ related to A*. All other details of the proof remain 
unchanged. □ 
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